Wine Quality Analysis¶

Created 17-Oct-2024 Mark A. Goforth, Ph.D.¶

Purpose¶

This notebook is designed for EDA and train a DNN model to perform a quality estimation of wine by it's chemical composition.

Goal¶

Challenges & Discussion¶

General Steps for Approach¶

  1. Download data

    • wine quality data is downloaded from kaggle
  2. EDA

    • identify independent variables that influence the outcome
  3. Feature Engineering

    • normalize and standardize independent variables as necessary
    • reduce dimensionality
  4. Train/Test Split

    • split data for training and final testing to see how performance will be in the real world
    • use random shuffle and stratified split to preserve proportions of classes
  5. Model Selection, Cross Validation, and Tuning

    • use K-fold cross validation to reduce bias, build more generalized model and prevent overfitting
    • apply hyperparameter tuning to search for best settings that provide improved bias and variance
  6. Model Validation

    • run model on test set to see how model will perform on real world data
  7. Create GAN (TBD)

    • create a Generative Adversarial Network (GAN) deep learning architecture
    • train two neural networks to compete against each other to generate more authentic new data from a given training dataset
  8. Create VAE (TBD)

    • create a Variational Autoencoder (VAE) deep learning architecture
    • train neural network to use in anomaly detection

Conclusion¶

In [ ]:
# install any necessary python packages
!pip install kagglehub
In [ ]:
!pip install tensorflow
In [ ]:
!pip install keras_tuner

Import Libraries¶

In [1]:
import datetime
import time

import numpy as np 
import pandas as pd 
# import matplotlib.pyplot as plt 
import seaborn as sns 
import scipy.stats as stats
import statsmodels.api as sm

import pylab as plt
from IPython.display import Image
from IPython.core.display import HTML 
from pylab import rcParams

import sklearn
from sklearn import decomposition
from sklearn.decomposition import PCA
from sklearn import datasets

import kagglehub
import ppscore as pps

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn import metrics
from sklearn.metrics import confusion_matrix

import pickle

import tensorflow as tf
from tensorflow.keras import datasets, layers, models, losses
from matplotlib import pyplot as plt

import keras_tuner
import keras
2024-10-20 12:28:59.926457: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Download latest dataset version¶

In [2]:
pathstr = kagglehub.dataset_download("adarshde/wine-quality-dataset")
print("Path to dataset files:", pathstr)
df = pd.read_csv(pathstr+'/winequality-dataset_updated.csv')
df = df.drop_duplicates()
Path to dataset files: /Users/Mark/.cache/kagglehub/datasets/adarshde/wine-quality-dataset/versions/3

Exploratory Data Analysis (EDA)¶

In [3]:
df.head()
Out[3]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.3 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.2 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1760 entries, 0 to 1998
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1760 non-null   float64
 1   volatile acidity      1760 non-null   float64
 2   citric acid           1760 non-null   float64
 3   residual sugar        1760 non-null   float64
 4   chlorides             1760 non-null   float64
 5   free sulfur dioxide   1760 non-null   float64
 6   total sulfur dioxide  1760 non-null   float64
 7   density               1760 non-null   float64
 8   pH                    1760 non-null   float64
 9   sulphates             1760 non-null   float64
 10  alcohol               1760 non-null   float64
 11  quality               1760 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 178.8 KB
In [5]:
df.describe().T.style.background_gradient(axis=0)
Out[5]:
  count mean std min 25% 50% 75% max
fixed acidity 1760.000000 8.710455 2.293976 4.600000 7.100000 8.000000 10.000000 15.900000
volatile acidity 1760.000000 0.545045 0.183404 0.120000 0.400000 0.540000 0.660000 1.580000
citric acid 1760.000000 0.244261 0.180000 0.000000 0.110000 0.190000 0.380000 1.000000
residual sugar 1760.000000 3.844392 3.424476 0.900000 2.000000 2.400000 3.800000 15.990000
chlorides 1760.000000 0.074782 0.050203 0.010000 0.050000 0.074000 0.086000 0.611000
free sulfur dioxide 1760.000000 20.788636 16.118756 1.000000 9.000000 16.000000 28.000000 72.000000
total sulfur dioxide 1760.000000 53.722443 37.795090 6.000000 24.000000 44.000000 75.000000 289.000000
density 1760.000000 0.996411 0.002118 0.990070 0.995200 0.996550 0.997800 1.003690
pH 1760.000000 3.286381 0.286839 2.340000 3.170000 3.300000 3.420000 4.160000
sulphates 1760.000000 0.989398 0.821606 0.330000 0.560000 0.660000 0.870000 3.990000
alcohol 1760.000000 10.711487 1.411144 8.400000 9.500000 10.400000 11.500000 15.000000
quality 1760.000000 5.627841 1.312301 2.000000 5.000000 6.000000 6.000000 9.000000

Attribute Information¶

Feature Explain
fixed acidity most acids involved with wine or fixed or nonvolatile
volatile acidity the amount of acetic acid in wine
citric acid the amount of citric acid in wine
residual sugar the amount of sugar remaining after fermentation stops
chlorides the amount of salt in the wine
free sulfur dioxide the amount of free sulfur dioxide in the wine(those available to react and thus exhibit both germicidal and antioxidant properties)
total sulfur dioxide amount of free and bound forms of SO2
density the measurement of how tightly a material is packed together
PH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4
Alcohol the percent alcohol content of the wine
quality output variable (based on sensory data, score between 3 and 8)

check for missing values¶

In [7]:
df.isna().sum()
Out[7]:
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

Visualization - create histograms for each independent variable¶

In [8]:
for i in df.columns:
    plt.figure(figsize=(6, 4)) 
    sns.histplot(data=df[i])
    plt.title(f'{i}')
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Visualization - create box plots¶

In [9]:
columns = list(df.columns)
fig, ax = plt.subplots(11, 2, figsize=(15, 45))
plt.subplots_adjust(hspace = 0.5)
for i in range(11) :
    # AX 1
    sns.boxplot(x=columns[i], data=df, ax=ax[i, 0])
    # Ax 2
    sns.scatterplot(x=columns[i], y='quality', data=df, hue='quality', ax=ax[i, 1])
No description has been provided for this image

compare each dependent variable with quality using box plots¶

In [11]:
for i in df.columns:
    if i != 'quality':
        plt.figure(figsize=(6, 4))  # Set figure size for each plot
        sns.boxplot(data=df, x='quality', y= i)
        plt.title(f'Box plot for quality and {i}')
        plt.tight_layout()
        plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [12]:
for i in df.columns:
    if i != 'quality':
        plt.figure(figsize=(6, 4))
        sns.violinplot(data=df, x='quality', y=i)
        plt.title(f'Violin plot for {i} by Quality')
        plt.tight_layout()
        plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Correlate each dependent variable with quality¶

In [13]:
%matplotlib inline
rcParams['figure.figsize'] = 12, 10
sns.set_style('whitegrid')
In [14]:
# Plotting the correlation heatmap
dataplot = sns.heatmap(df.corr(), cmap="YlGnBu", annot=True, annot_kws={"size": 12})

# Displaying heatmap
plt.show()
No description has been provided for this image
In [15]:
rcParams['figure.figsize'] = 15, 15
sns.pairplot(df, hue='quality', corner = True, palette='Blues')
Out[15]:
<seaborn.axisgrid.PairGrid at 0x177f31eb0>
No description has been provided for this image
In [16]:
# Plot the top N components
dfc = df.corr().iloc[:-1,-1:].sort_values(by='quality', ascending=True)
# dfc = df.corr().iloc[-1:,:-1].sort_values(by='quality', ascending=False).transpose()
# dfc = dfc.set_index('Source').rename_axis(None)
# dfc = df.corr().iloc[:-1,-1:].sort_values(by='quality', ascending=False).transpose()
# type(dfc)
# dfc.loc['quality'].plot(kind='bar', figsize=(10,4) )
dfc.plot.barh(figsize=(10,4) )
Out[16]:
<Axes: >
No description has been provided for this image

Prepare data for machine learning training¶

In [17]:
X = df.drop('quality', axis=1)
variable_names = X.columns
In [18]:
variable_names
Out[18]:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol'],
      dtype='object')
In [19]:
X.head()
Out[19]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
0 7.3 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8
4 7.2 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
In [20]:
pca = decomposition.PCA()
wine_pca = pca.fit_transform(X)
explained_variance = pca.explained_variance_ratio_
In [21]:
comps = pd.DataFrame(pca.components_, columns=variable_names)
comps
Out[21]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
0 0.002627 0.000503 -0.000344 0.028497 -0.000205 0.244696 0.969158 -0.000004 -0.000523 0.005899 0.001288
1 0.016668 0.000230 -0.002320 0.081072 -0.000844 0.965255 -0.246279 -0.000019 -0.001673 0.017862 0.021218
2 0.321359 0.000777 0.000026 0.928018 -0.004029 -0.089868 -0.006305 0.000001 -0.011790 0.109290 0.123671
3 -0.943642 0.011013 -0.033403 0.321132 -0.000738 -0.012454 -0.003658 -0.000299 0.027195 -0.027042 0.059466
4 0.009065 -0.011031 0.004401 -0.146875 -0.007204 -0.010019 0.004919 -0.000524 0.001574 0.099087 0.983975
5 -0.061430 0.020873 -0.042513 -0.081280 -0.005116 -0.008396 -0.001221 -0.000298 -0.012820 0.987354 -0.110666
6 0.035398 0.118715 -0.174634 -0.000466 -0.022120 0.000611 0.000135 -0.000363 0.976547 0.004652 -0.000471
7 0.023222 0.787008 -0.580439 -0.008426 -0.028641 -0.001187 -0.000008 -0.001059 -0.200759 -0.042043 0.014280
8 -0.017816 0.602446 0.783734 0.004263 0.130421 0.001325 -0.000519 0.003855 0.070416 0.022155 0.002678
9 0.004173 -0.053869 -0.124076 0.001785 0.990743 0.000187 0.000063 0.003171 0.006645 0.002232 0.007200
10 0.000203 0.001271 0.003326 0.000036 0.003688 -0.000001 -0.000002 -0.999987 0.000145 -0.000203 -0.000483
In [23]:
rcParams['figure.figsize'] = 10, 10
sns.heatmap(comps, cmap='Blues', annot=True )
Out[23]:
<Axes: >
No description has been provided for this image
In [24]:
# Plot the top N components
maxcol = np.argmax(pca.components_, axis=1)
n_components = 5  # Number of top components to display
rcParams['figure.figsize'] = 10, 4
plt.bar(range(0, n_components ), explained_variance[:n_components])
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Top {} Principal Components'.format(n_components))
plt.xticks(np.arange(5), variable_names[maxcol[0:5]])
plt.show()
No description has been provided for this image
In [25]:
ppscore_list = [pps.score(df, colName, 'quality') for colName in variable_names]
df_pp_score = pd.DataFrame(ppscore_list).sort_values('ppscore', ascending=False)
# df_pp_score
In [26]:
df_pp_score
Out[26]:
x y ppscore case is_valid_score metric baseline_score model_score model
10 alcohol quality 0.004587 regression True mean absolute error 0.973295 0.968831 DecisionTreeRegressor()
0 fixed acidity quality 0.000000 regression True mean absolute error 0.973295 1.000633 DecisionTreeRegressor()
1 volatile acidity quality 0.000000 regression True mean absolute error 0.973295 1.000094 DecisionTreeRegressor()
2 citric acid quality 0.000000 regression True mean absolute error 0.973295 0.975877 DecisionTreeRegressor()
3 residual sugar quality 0.000000 regression True mean absolute error 0.973295 1.129688 DecisionTreeRegressor()
4 chlorides quality 0.000000 regression True mean absolute error 0.973295 1.010453 DecisionTreeRegressor()
5 free sulfur dioxide quality 0.000000 regression True mean absolute error 0.973295 1.018520 DecisionTreeRegressor()
6 total sulfur dioxide quality 0.000000 regression True mean absolute error 0.973295 1.036547 DecisionTreeRegressor()
7 density quality 0.000000 regression True mean absolute error 0.973295 1.173801 DecisionTreeRegressor()
8 pH quality 0.000000 regression True mean absolute error 0.973295 1.065762 DecisionTreeRegressor()
9 sulphates quality 0.000000 regression True mean absolute error 0.973295 1.053535 DecisionTreeRegressor()
In [27]:
ax = df_pp_score.plot.barh(x='x',y='ppscore')
ax.set_xscale('log')
No description has been provided for this image

normalize data¶

In [28]:
# Create X from DataFrame and y as Target
X_temp = df.drop(columns='quality')
y = df.quality
In [29]:
scaler = MinMaxScaler(feature_range=(0, 1)).fit_transform(X_temp)
X = pd.DataFrame(scaler, columns=X_temp.columns)
X.describe().T.style.background_gradient(axis=0, cmap='Blues')
Out[29]:
  count mean std min 25% 50% 75% max
fixed acidity 1760.000000 0.363757 0.203007 0.000000 0.221239 0.300885 0.477876 1.000000
volatile acidity 1760.000000 0.291127 0.125619 0.000000 0.191781 0.287671 0.369863 1.000000
citric acid 1760.000000 0.244261 0.180000 0.000000 0.110000 0.190000 0.380000 1.000000
residual sugar 1760.000000 0.195122 0.226937 0.000000 0.072896 0.099404 0.192180 1.000000
chlorides 1760.000000 0.107791 0.083533 0.000000 0.066556 0.106489 0.126456 1.000000
free sulfur dioxide 1760.000000 0.278713 0.227025 0.000000 0.112676 0.211268 0.380282 1.000000
total sulfur dioxide 1760.000000 0.168631 0.133552 0.000000 0.063604 0.134276 0.243816 1.000000
density 1760.000000 0.465591 0.155540 0.000000 0.376652 0.475771 0.567548 1.000000
pH 1760.000000 0.519989 0.157604 0.000000 0.456044 0.527473 0.593407 1.000000
sulphates 1760.000000 0.180163 0.224483 0.000000 0.062842 0.090164 0.147541 1.000000
alcohol 1760.000000 0.350225 0.213810 0.000000 0.166667 0.303030 0.469697 1.000000

total count for each label¶

In [30]:
df.quality.value_counts()
Out[30]:
5    632
6    575
7    233
4     98
3     60
9     60
8     58
2     44
Name: quality, dtype: int64
In [31]:
# Convert labels to one-hot encoding
y = tf.keras.utils.to_categorical(y)
In [32]:
# Split Dataframe
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
In [33]:
y_train
Out[33]:
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
In [34]:
#------------------------------------------------------------------------------
# hyperparameter tuning function
#------------------------------------------------------------------------------
def build_model(hp):

    print(hp)

    inputs = keras.Input(shape=(11, 11, 9))
    x = inputs

    n_layers = hp.Int( "n_layers", 2, 24 )
    nodeunits = hp.Int( 'units', 4, 32 )
    dropout =hp.Float( "dropout",0,0.25)
    learning_rate = hp.Float( "learning_rate", 0.00001, 10 )
    # batch_size = hp.Float( "batch_size", 4, 64 )
    optimizer = hp.Choice( "optimizer", ["adam", "adamax"] )

    # model = keras.Sequential()
    # model.add(keras.layers.Dense(
    #     hp.Choice('units', [8, 16, 32]),
    #     activation='relu'))
    # model.add(keras.layers.Dense(1, activation='relu'))
    # model.compile(loss='mse')

    #--------------------------------------
    # configure model
    #--------------------------------------

    #	number of hidden layers
    #	number of neurons
    #	activation function (relu)
    #	output layer (sigmoid for binary classification; softmax for binary or multiclass)

    # initialize ANN
    ann = tf.keras.models.Sequential()

    # add hidden layers
    for i in range(n_layers):
        ann.add(tf.keras.layers.Dense(nodeunits,activation="relu"))
    
    if dropout > 0:
        x = layers.Dropout(dropout)(x)

    # create output layer (number of units = number of classes)
    # ann.add(tf.keras.layers.Dense(units=1,activation="sigmoid"))
    ann.add(tf.keras.layers.Dense(units=10,activation="softmax"))

    # compile model
    # optimizers:
    #   Adam (good), AdamW, Adadelta, Adagrad, Adamax (good), Nadam, Ftrl (bad), Lion (very noise loss), SGD (good but takes alot of epochs)
    if optimizer == "adamax":
        opt = tf.keras.optimizers.Adamax(learning_rate)
    else:
        opt = tf.keras.optimizers.Adam(learning_rate)

    #	loss function
    #   mse, binary_crossentropy, categorical_crossentropy

    #	metrics (accuracy)

    ann.compile(optimizer=opt,loss="categorical_crossentropy",metrics=['accuracy'])

    return ann
In [35]:
# train DNN model

# run start time
print("start time: "+str(datetime.datetime.now()))
starttime = time.time()

numtrials = 200
numepochs = 25

# fitout = ann.fit( X_train, Y_train, batch_size=batchsize, validation_data=(X_test,Y_test), epochs=numepochs )

hp = keras_tuner.HyperParameters()

# hp.values["model_type"] = 

hp.Float(
    "learning_rate",
    min_value=0.0001,
    max_value=0.1,
    sampling="log" )

hp.Int(
    "n_layers",
    min_value=2,
    max_value=4 )

hp.Int(
    "units",
    min_value=11,
    max_value=11 )

hp.Float(
    "dropout",
    min_value=0.0,
    max_value=0.05 )

hp.Int(
    "batch_size",
    min_value=4,
    max_value=32 )

hp.Choice(
    "optimizer",
     ["adam"] )

# hyperparameter tuning
dts = str(datetime.datetime.now().isoformat(timespec="seconds"))
dts = dts.replace(":","")
pathout = "./tuner_"+dts
print("output path: "+pathout)
# tuner = keras_tuner.RandomSearch(
tuner = keras_tuner.BayesianOptimization(
    build_model,
    objective='val_loss', # val_accuracy val_loss
    max_trials=numtrials,
    directory=pathout,
    hyperparameters=hp )

tuner.search( X_train, y_train, epochs=numepochs, validation_data=(X_test,y_test))
tuner.search_space_summary()
tuner.results_summary()

print( datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") + "  runtime: " + str(round(time.time()-starttime,3)) + " seconds" )
Trial 200 Complete [00h 00m 06s]
val_loss: 1.4633156061172485

Best val_loss So Far: 1.1767619848251343
Total elapsed time: 00h 17m 25s
Search space summary
Default search space size: 6
learning_rate (Float)
{'default': 0.0001, 'conditions': [], 'min_value': 0.0001, 'max_value': 0.1, 'step': None, 'sampling': 'log'}
n_layers (Int)
{'default': None, 'conditions': [], 'min_value': 2, 'max_value': 4, 'step': 1, 'sampling': 'linear'}
units (Int)
{'default': None, 'conditions': [], 'min_value': 11, 'max_value': 11, 'step': 1, 'sampling': 'linear'}
dropout (Float)
{'default': 0.0, 'conditions': [], 'min_value': 0.0, 'max_value': 0.05, 'step': None, 'sampling': 'linear'}
batch_size (Int)
{'default': None, 'conditions': [], 'min_value': 4, 'max_value': 32, 'step': 1, 'sampling': 'linear'}
optimizer (Choice)
{'default': 'adam', 'conditions': [], 'values': ['adam'], 'ordered': False}
Results summary
Results in ./tuner_2024-10-20T123331/untitled_project
Showing 10 best trials
Objective(name="val_loss", direction="min")

Trial 094 summary
Hyperparameters:
learning_rate: 0.01313017789171866
n_layers: 3
units: 11
dropout: 0.019680902684635102
batch_size: 18
optimizer: adam
Score: 1.1767619848251343

Trial 077 summary
Hyperparameters:
learning_rate: 0.037046280053398446
n_layers: 3
units: 11
dropout: 0.013766438665671033
batch_size: 16
optimizer: adam
Score: 1.178104281425476

Trial 064 summary
Hyperparameters:
learning_rate: 0.014235594870374568
n_layers: 3
units: 11
dropout: 0.014699279876672576
batch_size: 16
optimizer: adam
Score: 1.180485725402832

Trial 174 summary
Hyperparameters:
learning_rate: 0.03505231393488723
n_layers: 4
units: 11
dropout: 0.028969137670465796
batch_size: 19
optimizer: adam
Score: 1.180999517440796

Trial 048 summary
Hyperparameters:
learning_rate: 0.042793137162632534
n_layers: 3
units: 11
dropout: 0.0067480563266899985
batch_size: 20
optimizer: adam
Score: 1.1811329126358032

Trial 041 summary
Hyperparameters:
learning_rate: 0.03990196699511094
n_layers: 3
units: 11
dropout: 0.010040175924648448
batch_size: 21
optimizer: adam
Score: 1.181243896484375

Trial 005 summary
Hyperparameters:
learning_rate: 0.0251291873419375
n_layers: 2
units: 11
dropout: 0.007240227956324658
batch_size: 13
optimizer: adam
Score: 1.1818410158157349

Trial 083 summary
Hyperparameters:
learning_rate: 0.04050796566630129
n_layers: 3
units: 11
dropout: 0.016593980314083344
batch_size: 15
optimizer: adam
Score: 1.182966947555542

Trial 136 summary
Hyperparameters:
learning_rate: 0.06017700513596118
n_layers: 3
units: 11
dropout: 0.014104510639885154
batch_size: 28
optimizer: adam
Score: 1.1832951307296753

Trial 069 summary
Hyperparameters:
learning_rate: 0.019199806108826153
n_layers: 2
units: 11
dropout: 0.010091488143066733
batch_size: 17
optimizer: adam
Score: 1.1835540533065796
2024-10-20 12:50:56  runtime: 1045.087 seconds
In [36]:
# return the best hyperparameters
best_hp = tuner.get_best_hyperparameters()[0]
ann = tuner.hypermodel.build(best_hp)
<keras_tuner.src.engine.hyperparameters.hyperparameters.HyperParameters object at 0x1870c60c0>
In [37]:
# select the best model
best_model = tuner.get_best_models()[0]
best_model.summary()
<keras_tuner.src.engine.hyperparameters.hyperparameters.HyperParameters object at 0x1870c60c0>
/opt/anaconda3/lib/python3.12/site-packages/keras/src/saving/saving_lib.py:719: UserWarning: Skipping variable loading for optimizer 'adam', because it has 2 variables whereas the saved optimizer has 18 variables. 
  saveable.load_own_variables(weights_store.get(inner_path))
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 10)             │           120 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 516 (2.02 KB)
 Trainable params: 516 (2.02 KB)
 Non-trainable params: 0 (0.00 B)
In [38]:
numepochs = 30
# fitout = ann.fit( X_train, Y_train, batch_size=batchsize, validation_data=(X_test,Y_test), epochs=numepochs )
fitout = ann.fit( X_train, y_train, validation_data=(X_test,y_test), epochs=numepochs )

# save model
modelfilename = "ANN.keras"
ann.save(modelfilename)

# load model from file
# ann = models.load_model(modelfilename)

# print metrics
ann.summary()
Epoch 1/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.2670 - loss: 1.9892 - val_accuracy: 0.3659 - val_loss: 1.4808
Epoch 2/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.3236 - loss: 1.5679 - val_accuracy: 0.3636 - val_loss: 1.3920
Epoch 3/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3861 - loss: 1.4584 - val_accuracy: 0.4136 - val_loss: 1.3231
Epoch 4/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4487 - loss: 1.3159 - val_accuracy: 0.4614 - val_loss: 1.2638
Epoch 5/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4401 - loss: 1.3395 - val_accuracy: 0.4455 - val_loss: 1.3174
Epoch 6/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4725 - loss: 1.3255 - val_accuracy: 0.4705 - val_loss: 1.2288
Epoch 7/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4831 - loss: 1.3172 - val_accuracy: 0.4886 - val_loss: 1.2574
Epoch 8/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4457 - loss: 1.3168 - val_accuracy: 0.4795 - val_loss: 1.2465
Epoch 9/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4860 - loss: 1.2478 - val_accuracy: 0.4864 - val_loss: 1.2369
Epoch 10/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4792 - loss: 1.2282 - val_accuracy: 0.4841 - val_loss: 1.2606
Epoch 11/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4627 - loss: 1.2766 - val_accuracy: 0.4818 - val_loss: 1.2406
Epoch 12/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4986 - loss: 1.2991 - val_accuracy: 0.4932 - val_loss: 1.2345
Epoch 13/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4868 - loss: 1.2399 - val_accuracy: 0.4591 - val_loss: 1.2583
Epoch 14/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4662 - loss: 1.2786 - val_accuracy: 0.4750 - val_loss: 1.2593
Epoch 15/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4867 - loss: 1.2257 - val_accuracy: 0.4841 - val_loss: 1.2203
Epoch 16/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4908 - loss: 1.2727 - val_accuracy: 0.4455 - val_loss: 1.3006
Epoch 17/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4782 - loss: 1.2941 - val_accuracy: 0.4750 - val_loss: 1.2212
Epoch 18/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4706 - loss: 1.3481 - val_accuracy: 0.4909 - val_loss: 1.2204
Epoch 19/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4933 - loss: 1.2386 - val_accuracy: 0.4727 - val_loss: 1.2090
Epoch 20/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.5180 - loss: 1.1936 - val_accuracy: 0.4841 - val_loss: 1.2058
Epoch 21/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.5077 - loss: 1.2327 - val_accuracy: 0.4795 - val_loss: 1.2079
Epoch 22/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4789 - loss: 1.2774 - val_accuracy: 0.4773 - val_loss: 1.2189
Epoch 23/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4861 - loss: 1.2328 - val_accuracy: 0.4818 - val_loss: 1.2130
Epoch 24/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.5063 - loss: 1.2276 - val_accuracy: 0.4659 - val_loss: 1.2262
Epoch 25/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4865 - loss: 1.2454 - val_accuracy: 0.4886 - val_loss: 1.1995
Epoch 26/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.5258 - loss: 1.1798 - val_accuracy: 0.4886 - val_loss: 1.2104
Epoch 27/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4742 - loss: 1.2574 - val_accuracy: 0.4909 - val_loss: 1.2127
Epoch 28/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.4778 - loss: 1.2222 - val_accuracy: 0.4705 - val_loss: 1.2176
Epoch 29/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4811 - loss: 1.2224 - val_accuracy: 0.4591 - val_loss: 1.2598
Epoch 30/30
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4828 - loss: 1.2104 - val_accuracy: 0.4955 - val_loss: 1.2198
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_4 (Dense)                 │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_5 (Dense)                 │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_6 (Dense)                 │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_7 (Dense)                 │ (None, 10)             │           120 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 1,550 (6.06 KB)
 Trainable params: 516 (2.02 KB)
 Non-trainable params: 0 (0.00 B)
 Optimizer params: 1,034 (4.04 KB)
In [39]:
# accuracy metrics
history = fitout.history

acc = history['accuracy']
loss = history['loss']
val_acc = history['val_accuracy']
val_loss = history['val_loss']

print("final train accuracy: "+str(acc[-1]))
print("final train loss    : "+str(loss[-1]))

print("final val accuracy: "+str(val_acc[-1]))
print("final val loss    : "+str(val_loss[-1]))

print( datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") + "  runtime: " + str(round(time.time()-starttime,3)) + " seconds" )
final train accuracy: 0.4856060743331909
final train loss    : 1.2303400039672852
final val accuracy: 0.4954545497894287
final val loss    : 1.219757080078125
2024-10-20 12:51:56  runtime: 1105.25 seconds
In [40]:
epochs_range = range(numepochs)

plt.figure(figsize=(10,5))

plt.subplot(1,2,1)
plt.plot( epochs_range, acc, label='Training Accuracy' )
plt.plot( epochs_range, val_acc, label='Validation Accuracy' )
plt.legend( loc='lower right' )
plt.ylim(0,1)
plt.title('Training and Validation Accuracy', fontsize=15 )

plt.subplot(1,2,2)
plt.plot( epochs_range, loss, label='Training Loss' )
plt.plot( epochs_range, val_loss, label='Validation Loss' )
plt.legend( loc='upper right' )
# plt.ylim(0,1)
plt.title('Training and Validation Loss', fontsize=15 )

plt.show()
No description has been provided for this image

Run inference on test data to validate performance¶

In [41]:
# define a function to ploting Confusion matrix
def plot_confusion_matrix(y_test, y_prediction):
    '''Plotting Confusion Matrix'''
    cm = metrics.confusion_matrix(y_test, y_prediction)
    ax = plt.subplot()
    ax = sns.heatmap(cm, annot=True, fmt='', cmap="Blues")
    ax.set_xlabel('Prediced labels', fontsize=18)
    ax.set_ylabel('True labels', fontsize=18)
    ax.set_title('Confusion Matrix', fontsize=25)
    ax.xaxis.set_ticklabels(['Bad', 'Good', 'Middle'])
    ax.yaxis.set_ticklabels(['Bad', 'Good', 'Middle']) 
    plt.show()
In [42]:
# define a function to ploting Classification report
def clfr_plot(y_test, y_pred) :
    ''' Plotting Classification report'''
    cr = pd.DataFrame(metrics.classification_report(y_test, y_pred_rf, digits=3,
                                            output_dict=True)).T
    cr.drop(columns='support', inplace=True)
    sns.heatmap(cr, cmap='Blues', annot=True, linecolor='white', linewidths=0.5).xaxis.tick_top()
In [43]:
def clf_plot(y_test, y_pred) :
    '''
    1) Ploting Confusion Matrix
    2) Plotting Classification Report
    '''

    y_predmax = np.argmax(y_pred, axis=1) 
    y_testmax = np.argmax(y_test, axis=1)

    # metrics.f1_score(y_test, y_pred, average='weighted', labels=np.unique(y_pred))
    # metrics.f1_score(y_test, y_pred, average='weighted',zero_division=0)

    cm = metrics.confusion_matrix(y_testmax, y_predmax)
    cr = pd.DataFrame(metrics.classification_report(y_testmax, y_predmax, digits=3, output_dict=True)).T
    cr.drop(columns='support', inplace=True)
    
    fig, ax = plt.subplots(1, 2, figsize=(15, 5))
    
    # Left AX : Confusion Matrix
    ax[0] = sns.heatmap(cm, annot=True, fmt='', cmap="Blues", ax=ax[0])
    ax[0].set_xlabel('Prediced labels', fontsize=18)
    ax[0].set_ylabel('True labels', fontsize=18)
    ax[0].set_title('Confusion Matrix', fontsize=25)
    # ax[0].xaxis.set_ticklabels(['Bad', 'Good', 'Middle'])
    # ax[0].yaxis.set_ticklabels(['Bad', 'Good', 'Middle'])
    
    # Right AX : Classification Report
    ax[1] = sns.heatmap(cr, cmap='Blues', annot=True, linecolor='white', linewidths=0.5, ax=ax[1])
    ax[1].xaxis.tick_top()
    ax[1].set_title('Classification Report', fontsize=25)
    plt.show()
In [44]:
# test predict (inference)

y_pred = ann.predict(X_test)
# y_pred = (y_pred > 0.5)
y_predmax = np.argmax(y_pred, axis=1) 
y_testmax = np.argmax(y_test, axis=1)
cm = confusion_matrix( y_testmax, y_predmax)
print(cm)

#ann_score = round(ann.score(X_test, y_test), 3)
#print('ANN score : ', ann_score)
clf_plot(y_test, y_pred)
14/14 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 
[[ 0  1  0  1  0  6  0  5]
 [ 0  1  0  1  0  2  0  2]
 [ 0  2  0  9  5  5  0  1]
 [ 0  0  0 96 50  5  0  9]
 [ 0  0  0 43 96 21  0  1]
 [ 0  3  0  1 20 22  0  7]
 [ 0  1  0  0  2  5  0  6]
 [ 0  2  0  0  1  5  0  3]]
/opt/anaconda3/lib/python3.12/site-packages/sklearn/metrics/_classification.py:1509: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/opt/anaconda3/lib/python3.12/site-packages/sklearn/metrics/_classification.py:1509: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/opt/anaconda3/lib/python3.12/site-packages/sklearn/metrics/_classification.py:1509: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
No description has been provided for this image
In [ ]: